Simulation and analysis of in vitro DNA evolution 
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We study theoretically the in vitro evolution of a DNA sequence by binding to a transcription 
factor. Using a simple model of protein-DNA binding and available binding constants for the 
Mnt protein, we perform large-scale, realistic simulations of evolution starting from a single DNA 
sequence. We identify different parameter regimes characterized by distinct evolutionary behaviors. 
For each regime we find analytical estimates which agree well with simulation results. For small 
population sizes, the DNA evolutional path is a random walk on a smooth landscape. While for 
large population sizes, the evolution dynamics can be well described by a mean-field theory. We 
also study how the details of the DNA-protein interaction affect the evolution. 



The concept of evolution has not only fundamentally 
shaped our view of biology, but also found rich and pro- 
found applications in bioengineering and biotechnology. 
In particular, in vitro evolution has been widely used to 
evolve DNA RNA 0] and proteins y]. In evolution, 
mutations at the molecular level are selected at the func- 
tional level. A quantitative theory of an evolutionary 
process would reqire a quantitative understanding of the 
selection process (e.g. fitness function, landscape, selec- 
tion pressure, etc.). While this is in general difficult to 
achieve for natural or laboratory evolution, there are sim- 
ple cases where such a quantitative description is readily 
available. Based on experiments of RNA virus evolu- 
tion Q , Levine and colleagues 0. IE studied a simple 
model of evolutionary process in a smooth landscape in 
which the fitness of an individual is given as the sum of 
many individual contributions that can be mutated inde- 
pendently. Their studies found good agreement between 
theory and simulations for small population sizes and for 
equilibrium mean field theory, but the evolution dynam- 
ics turned out to be pathological in large population sizes 
where extremely rare mutations are exponentially ampli- 
fied, yielding infinite speed of evolution. Peng et ai. 
proposed DNA binding to proteins as a model system 
for evolution in a smooth landscape and studied a model 
where a large population of long DNA molecules were 
subjected to high mutation rates and selected by how 
strongly they bind to a protein (the histone-octamer was 
mentioned as a possible example). Their model can be 
well described by a continuum equation and they have 
shown that the average distance to the highest affinity se- 
quence exponentially approaches its equilibrium value. Q 

In a recent experiment, Dubertret et al. studied 
the in vitro evolution of DNA sequences via binding to 
the lac repressor protein which is kept unchanged. In 
the experiment, a round of evolution consists of ampli- 
fication and mutation of the DNA pool by error-prone 



PCR (Polymerase Chain Reaction) followed by selecting 
a fixed fraction of the DNA sequences via washing off 
relatively weak protein binders. Starting from a mix of 
random DNA sequences and by sequencing a number of 
DNA sequences after each round, they were able to ob- 
serve the dynamics of DNA population evolving towards 
the wild type (WT) sequence. In such a molecular breed- 
ing experiment, the relation between the genotype (DNA 
sequence) and the phenotype (binding affinity to a pro- 
tein) is direct and simple. If the binding constants are 
known for various DNA sequences, the selection process 
can be modeled quantitatively and it can then serve as a 
model system for quantitative analysis of molecular evo- 
lution. 

In this paper, we study theoretically the in vitro evolu- 
tion of DNA sequences via binding to the mnt repressor 
protein. The system of DNA-mnt is perhaps the best 
experimentally characterized system of sequence-specific 
DNA-protein binding, [ToL llll IT^| and is particularly 
suited for a thorough quantitative study of molecular 
evolution. Specifically, the binding constants of the WT 
DNA sequence and many other sequences, including all 
the sequences one point-mutation away from the WT, are 
measured experimentally. |lOj Furthermore, it has been 
demonstrated that the binding energy of a sequence can 
be approximately decomposed as the sum of the contri- 
butions from the individual bases. 

[HG3 This additive 
form of binding energy greatly simplifies the analysis — 
it enables us to perform realistic large scale simulations 
as well as to obtain analytic solutions and estimates in 
various cases. In our study, we explore various regimes 
of experimentally accessible parameters and we find very 
distinct evolution dynamics in different regimes. 



I. MODELS AND METHODS 

We assume that the binding energy between a DNA 
molecule and the mnt protein is given by the sum of the 
contributions from individual base pairs, 
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where Si G A, C, G, T is the base at the i-th position of 
the DNA sequence and is the contribution to the 

binding energy from the i-th position for which we use 
the experimentally determined value in Ref. |ldj . The 
relative binding constants are then K(S) = ]X Ki(Si) = 



n, 



We start with a population size N of DNA 



molecules of the same sequence that is significantly dif- 
ferent from the WT, avoiding the potential problem of 
enrichment dominating the evolution [t| (i.e. sequences 
very close to the WT in the initial pool being amplified 
exponentially). An iteration of the evolutionary process 
consists of an amplification with mutation followed by 
a selection. During amplification the population is dou- 
bled I times (corresponding to, e.g. I cycles of PCR), so 
that the population size is increased to 2 1 N. We assume 
that at each duplication there is an error rate of r per 
base (see the appendix for more on this). The popula- 
tion is then subject to a selection process via binding to 
the mnt protein. Each DNA molecule is selected with a 
probability 1+e ^ ( i (S )-n) ; where the chemical potential \i 
is chosen such that the expected number of selected DNA 
molecules is N. We iterate the evolutionary process un- 
til at least 90% of the DNA molecules are WT, and we 
denote the number of iterations required ty[- 

Simulation. The binding site for the mnt repressor con- 
sists of 17 important base pairs, at positions 3 through 
19. t For our starting sequence, we chose, by random 
mutations from WT, a sequence that differs from WT at 
?7i = 6 positions. We denoted by SS the starting sequence 
(Table [J. A simulation starts with N copies of SS. We 
call a specific sequence of mutations that take an SS to 
the WT an evolution path. Each path contains the six 
required mutations in some order, and may contain ad- 
ditional mutations. * There are 6! = 720 minimum paths 
that only contain the six required mutations. We iterate 
the evolutionary process until at least 90% of the DNA 
molecules are WT. We denote the number of iterations 
required tj^ and record the number of molecules coming 
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SS base 
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A A' 


2.13 
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5.0 
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8.3 



TABLE I: The starting sequence SS. AK = e~ pAE is the 
ratio of the binding constants due to the mutation at the 
specific position. 
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FIG. 1: The average fraction of WT contributed by the best 
path, the total fraction from all minimum paths, and the num- 
ber of different minimum paths used, for different population 
sizes N. The parameters r — 10~ 4 (solid lines), r — 10~ 7 
(dashed lines), and 7 = 1 are held fixed. The various regimes 
(random walk, middle ground, and crossover to mean field) 
are indicated. 



from each different path. 



II. RESULTS 



* In addition to this sequence-specific binding, DNA molecules 
may bi n d to th e transcription factor in a non-specific man- 
ner [3 El El, Ktot(S) = K(S) + K NSB , and this binding 
dominates for DNA sequences that are very far from WT. In 
this paper, we only consider DNA sequences that are sufficiently 
close to WT so that the non-specific binding can be neglected — 
we still include it as a parameter in our simulations, but ignore 
it in the analytical estimates. See the appendix for more on 
non-specific binding. 

T The mnt repressor is a dimer of two identical halves, which causes 
the WT to be palindromic around position 11. The actual wild 
type sequence differs from the sequence with the highest affinity 
at one position (19), we denote the highest affinity sequence by 
WT for convenience. ITU 

J If a molecule is mutated at more than one position during a single 
duplication cycle, we randomly assign an order to these muta- 
tions. As a precaution, we have also performed simulations where 
no more than one mutation per duplication cycle per molecule 
was allowed, and verified that there were no qualitative differ- 
ences. 



Some of the key quantities from a single simulation run 
are the fraction of WT that was produced through min- 
imum paths, /jJJjq , the number of minimum paths used, 
n m ; n , and the fraction of the WT produced through the 
single best path in the simulation, f^S- Fig. ^ shows 
how these quantities depend on the population size (av- 
eraged over many simulations). We see that is small 
for very small N and is very close to 1 for large N, with 
a fairly sharp transition, whereas /3£ slowly decreases 
from 1 with increasing N. This indicates that we may 
expect to find qualitatively different behavior for small 
and large population sizes. 

Small N: Random Walk. Let us first consider the 
case of N = 1 and 7=1, i.e. a single DNA molecule is 
duplicated once and one of the two molecules is selected 
at each iteration. If there are no mutations during the 
duplication, the two molecules are identical and nothing 



interesting can happen. If there is a mutation of type§ 
i (the chance of multiple mutations in the same duplica- 
tion is negligible), the binding constant of the copy will 
be AKi times that of the original, and the chance of se- 
lecting the mutant is X ^^ K . , which is high for favorable 
mutations (AKi > 1) an d low for unfavorable ones. The 
DNA molecule will thus perform a biased random walk: 
The molecule will make a step whenever a mutant is se- 
lected, and the steps that improve the binding to the 
protein are favored over the steps that reduce the bind- 
ing. Considering the probability of making each step, we 
find that we have exactly a random walk in the energy 
landscape given by the binding energy — in equilibrium, 
Prob(S*) oc e-^( s ). 

Now consider a population size N > 1, keeping 1=1. 
As long as N is sufficiently small (N <C 1/r), the chance 
of a mutation happening in any single iteration is very 
low. When a mutation happens, it will almost cer- 
tainly either "die out" (disappear from the population) 
or spread through the whole population before the next 
mutation occurs, so that most of the time the population 
consists of N identical DNA molecules. During selection, 
the chance of choosing a particular combination of DNA 
molecules is proportional to the product of their bind- 
ing constants. Thus, if there are m mutants of type i in 
the population at iteration i, the chance that there will 
be m + j mutants in the population the next iteration 
is (AKi) 2: > times the chance that there will be m — j 
mutants (as we select exactly half the DNA, the combi- 
natorics are identical). The probabilities p±(m) that m 
identical mutants in a population of size N will spread 
through the whole population (p+) or will die out (p~) 
then satisfy 



p£ (N — m) = (AKi) p_ (m) 



(2) 



Including the probability that a mutant will be created 
and selected in the first place, we find the rate Ri at 
which a single mutation of type i will cause the whole 
population to be replaced: 



Ri — 



AKi N n ,Nr (AK t ) 2N 
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FIG. 2: Average number of different bases between DNA pool 
and WT as function of time, and the exponential fit. r = 
10~ 4 , N = 256, 1 = 1, averaged over 8192 runs. 



The average time needed to improve the DNA pool by 
one base relative to WT can now be estimated as 



(T) 



1 



AKi 



log(AQ 



(4) 



(3) 



where Eq. and the condition p^l(m) + p_(m) = 1 are 
used, and the approximation is valid for \2N log AKi\ >> 
1. The population again describes a random walk, but 
this time the energy landscape is 2N — 1 times the bind- 
ing energy of a single DNA molecule — in equilibrium, 
Prob(7V copies of S) oc e -0(^-i)E(S)_ 



where AK is a typical value for AKi. The first term 
is the time required to create a "seed" mutation: The 
sum is over all possible correct mutations, ^ is the 
chance of each mutation occurring in a given iteration, 
and 1 — is the chance that the mutation survives. 
The second term is the time required for the mutation to 
spread through the population (for each mutant, n^y: 
is the average number of surviving children after one it- 
eration), which is negligible for small N. Since the first 
term, which dominates, is inversely proportional to the 
distance from WT (i.e. the number of terms in the sum), 
the average speed of the DNA pool will be proportional 
to that distance. Fig. [3 shows the distance from WT as 
a function of time, and except in the beginning, it can be 
almost perfectly fitted to an exponential — this is similar 
to the result in 0, which is for a very different regime. 
The corrections for the beginning are precicely what we 
would expect from the second term: It reduces the speed 
when the speed is large and causes a short delay. Our 
result for the evolution speed in the random walk (RW) 
regime is very similar to the one found in ;§J for a birth- 
rate model: As long as the second term of equation 0] is 
negligible, the evolution speed of the DNA population is 
proportional to the mutation rate and to the population 
size N. 

Given a sequence 5*, the chance that mutation i will 
be the next surviving favorable mutation (in the limit of 
small Nr and large N) is simply 



Mutation 
volvcd. 



'type" specifies both the position and the bases in- 



PRWmut(S, i) — 
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AKi 
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TABLE II: Lower bounds for the mean field regime. 



0.0005 0.001 0.0015 

Calculated probability of path 

FIG. 3: Probabilities of individual minimum paths in random 
walk regime — observed vs. predicted, r — 10~ 7 , N = 512, 
1 = 1, averaged over 131,072 runs. 

where the sum is over all possible favorable mutations 
to S. The chance of a given path tt will thus be 
-PwpathW = Iln PRWmvLt(SZ,TT n ). Fig. |3| compares 
these predicted values with simulation results (which are 
Poisson distributed), and as we can see, they agree very 
well — the total observed normalized variance is 745.2, vs. 
expected 720. The approach to the small Nr limit for the 
fraction of minimum paths out of all the paths used, is 
shown in Fig. ^ 

When I > 1, the mutations with large AK will be sig- 
nificantly more likely to happen first, as they are more 
likely to survive the now stronger selection. The prob- 
ability of each path can still be calculated (in the same 
limit as before) , but this is far more complicated than for 
the case of / = 1. 

The random walk approach clearly doesn't work when 
the second term of equation 0] exceeds the first term: A 
mutation will not have time to spread through the whole 
population before the next favorable mutation occurs. 
Large N: Mean Field. For sufficiently large N, we 
expect mean field (MF) behaviour — there is no signifi- 
cant difference between one experiment/simulation and 
another, and each path contributes a fixed fraction to 
the total WT DNA produced. In principle, in the limit 
of N — > oo the fraction of population S at time t, 
f(S,t) can be traced from iteration to iteration and the 
chemical potential determined from the selection cvite- 

TionJ2sf( s ^+ 1 / 2 )/{ 1 + ex P( E (S)~K t ))} = 2_J > where 
f(S, t + 1/2) is the fraction of S just after the amplifica- 
tion but before the selection. This would require tracing 
all possible sequences — a tedious and often impractical 
task. In practice, it suffices to restrict ourselves to a lim- 
ited set of paths. With a fixed set of paths, we can find 
the fraction of the DNA at each step of each path at each 
iteration, as well as the chemical potential used for each 
selection. The calculations proceed the same way as for 
the finite N simulations, except there is no randomness 
involved, and we discard the DNA that departs from the 
chosen paths. 



From Fig. we found that for large N, almost all the 
WT is produced through minimum paths. Looking at the 
full parameter range we have explored, for r = 10~ 4 and 
1 — 20 the contribution to WT through minimum paths 
is about 98% for the largest N we can simulate, and for all 
other parameters it's well above 99% for large N — thus 
we expect any set that includes all the minimum paths 
to give a reasonable result. To get even more accurate 
results, we allow one erroneous mutation (this already 
increases the number of paths from 720 to 156,960) and 
verify that this is only a minor correction. 

SS 7T 

Let n Q ' be the amount of WT produced from one 
molecule through tt, i.e. number of WT sequences at time 
£m originating from a single molecule of sequence SS ex- 
isting at the beginning of the experiment, through any 
specific path tt. Once we know all the chemical poten- 
tials n(t) used for the selections, we can use a set of re- 
cursion relations to find the average («g S:7r ) and the vari- 
ance Var(ng S,7r ), as well as the probability P + (nQ S,7r ) = 
P(n^ s '^ > 0) that a single SS molecule at time t = 
will yield some nonzero amount of WT at t m through tt 
(details in the appendix). 

In the mean field regime, the amount of WT produced 
through each minimum path should be relatively con- 
stant from one experiment to the next, which gives us a 
lower bound for the population size: 

jV>max Va t" S " ) . (6) 

Table ILTl shows these bounds for a selection of parameters 
I and r. The dependence on r is roughly N ~ r _m , where 
m = 6 is the number of different bases between SS and 
WT. 

The speed of evolution in the mean field regime can be 
easily estimated in the case of I = 1. With 1=1, half 
the population is removed during each selection, thus the 
chemical potential for the selection will be close to the 
median binding energy in the population, and any DNA 
with significantly higher binding energy than the major- 
ity will almost certainly survive. In the very first itera- 
tion of the evolution process, a fraction r/3 of the DNA 
will get each of the m "correct" mutations. In the fol- 
lowing iterations, almost all of these improved DNA will 
survive the selections, and the amount of improved DNA 
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FIG. 4: Average number of different bases between DNA pool 



and WT as function of time. 
1 = 1. 
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FIG. 5: The predicted average contribution from each path 
for the various regimes, plotted against each other and, for 
the middle ground, compared to simulation results (average 
of 2 23 simulations with I = 1, N = 2 24 and r = 1CT 7 ). 



will thus roughly be doubled in each iteration. After 

logf— ) 

toy mr } 



T «l 



log(2) 



(7) 



iterations, the improved DNA will have replaced the orig- 
inal population, i.e. most of the DNA will only have m— 1 
errors relative to WT. 

Once the whole population has been improved by one 
base, the process is repeated. However, there have al- 
ready been To iteration in which the improved DNA could 
improve further through mutations, and this even more 
improved DNA has been amplified at the same rate as 
the regular improved DNA, i.e. the "seed fraction" of 
improved DNA will now be ^ To(m 3 1)r (the factor \ is 
because only the copy gets mutated in each amplifica- 
tion). The time required to improve the DNA pool by 
one base is thus roughly 



T(m') « 1 + 



!og(- 



log(2) 



(8) 



where m! is the number of errors left. Corrections in- 
clude that some of the improved DNA will be lost during 
selection, reducing the effective amplification from 2 to 
Ajr+i ' anc ^ higher order corrections to the factor T in 
the seed fraction. These and other corrections can be 
addressed by considering an infinite length model. 0] 

Fig.Q]shows how the average number of errors changes 
with time, and we see that the evolution speed is almost 
constant — using T(2) from equation [H] gives a very good 
fit. The first improvement takes somewhat longer, as 
expected, and so does the last improvement: For most 
of the DNA, the error at position 4 will be the last to 
be corrected, and the binding energy difference for that 
mutation is much smaller than for the others, thus the 
effective amplification is significantly smaller. 

The effects of large I and various analytical estimates 
for the mean field regime are discussed in the appendix. 



The Middle Ground. The argument used for the evo- 
lution speed in the mean field regime is qualitatively 
valid for all N > - (the factor T in the seed fraction 
does not fully apply until N > 4f), thus we expect a 
smooth transition from the RW evolution behavior, ran- 
dom with on average exponential approach, to the mean 
field behavior, constant speed. However, while all the key 
quantities shown in fig. ^ vary smoothly with the pop- 
ulation size, there is a significant region where ~1 
and f^J t > 0.5, i.e. typically a single minimum path 
dominates. In this region the average contribution of 
the different paths vary far more than in the RW and 
MF regimes (Fig. and the region can be considered 
a third parameter regime. Fig. [5] shows the result of a 
simulation in the middle ground regime, and while there 
in this simulation are 11 minimum paths that contribute 
WT, 70% of it comes from a single path. Note that all 
the minimum paths used here are "probable" ones (fairly 
red). 

As a single minimum path dominates in each simu- 
lation, the average contribution of the different paths 
should depend strongly on how often that path domi- 
nates. Given that the overall evolution behavior is sim- 
ilat to that of the mean field regime, we can use as a 
first guess the probabilities P^ IF (nQ S ' 7r ) from the large 
N discussion (we here add the superscript MF to em- 
phasize that these are mean field calculations). Fig. [S] 
shows the results from simulations plotted against this 
estimate, and there is a very accurate relationship be- 
tween them (it is not linear), at least for the most prob- 
able paths — for the less probable paths, statistical errors 
are large. The middle ground region corresponds fairly 

well to i < N < (E,r i + IF (™o S ' 7r ))~ 1 : Le - the re S ime 
ends approximately when the population is large enough 
that, using the mean field chemical potentials, we would 
expect to find at least some WT in most simulations. 
There is a very large crossover region between the MF 
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FIG. 6: Tree of evolution to WT (the root is SS) for a sample 
simulation with TV = 2 20 , r — 1CP 4 , and 1 = 1. Node size 
shows the sum over all the iterations of the amount of the se- 
quence (from that path) present during the simulation. Leaf 
size (square) shows the amount of WT produced (all nodes 
annotated with a square are WT). The color of the squares 
show the probability of that path using the mean field calcu- 
lation. 



regime and the middle ground, and a smaller region be- 
tween the middle ground and the RW regime (Fig. [1J. 

Fig. also shows the predictions (or "best guess" 
for the middle ground) for the various regimes plotted 
against each other, and it is clear that they are very dif- 
ferent, though they are somewhat correlated. 
Experimental Signatures. It is difficult to com- 
pletely and directly test the above theoretical analy- 
sis experimentally — one would need to sequence a large 
number of DNA. A more practical choice is to consider 
only the distance from a sequence to WT, i.e. the number 
of positions at which they differ, and study its variance 
in different regimes. In the random walk regime, at most 
times the DNA pool consists of only a single sequence, 
thus the variance of this distance is, at any given time in 
any given run, almost zero, but the variance from one run 
to the next can be very large. As we increase the popula- 
tion size, the variance within a run increases somewhat, 
but the variance between runs decreases drastically, and 
above the middle ground we have almost perfect coher- 
ence (Fig. 01. 

Fig. |H1 shows how the distribution of the distances 
changes through (a part of) a simulation, and it confirms 
our earlier assumptions: Only once the whole population 
has been "upgraded" does further improvement start to 
become significant, i.e. there are at most two distances 
with significant population at any one time. 
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FIG. 7: The variance of the number of errors relative to WT 
as a function of the time (iteration), r = 10~ 4 and 7=1 for 
all graphs, while TV = 2 9 for a), TV = 2 18 for b) and TV = 2 32 




FIG. 8: Distribution of distances from WT as funtion of time, 
r = 10 -4 , 1 = 1, TV = 2 39 « 5.5 ■ 10 11 



III. CONCLUSION 

Our simulation and analysis show that in the simple 
case of additive binding energy the evolution behavior of 
DNA-protein binding can be understood quantitatively 
and rather completely. Depending on the population size 
and the mutation rate, the evolutionary process exhibits 
distinct behaviors in three parameter regimes. Our re- 
sults are fairly general as long as the potential is mainly 
additive and can be used to make sense of experimental 
data. 

It is noteworthy that evolutionary processes via molec- 
ular breeding such as the one discussed here are funda- 
mentally different from those where the fitness of an indi- 
vidual depends solely on itself (e.g. on its genotype) and 
does not depend on others in the population. In the lat- 
ter case the fitness landscape is fixed [H El EB E3 an d 
the more fit the better, while in the former case the fit- 



ness landcape is dynamic and you just have to be bet- 
ter than the average — to be even better does not increase 
your fitness. The two cases can lead to, for example, very 
different evolution dynamics for large population sizes [5j . 

The additivity of the binding energy gives rise to a 
smooth landscape, which greatly simplifies the analysis. 
The inclusion of a small perturbative non-additive part 
to the potential would not change the picture, but would 
nonetheless provide insights to the cases of more general 
potentials and fitness functions. 

We thank Qi Ouyang and Terry Hwa for very helpful 
discussions. 



IV. APPENDIX 

A. Approximations 

Mutation rates. In the main paper we assumed that 
all mutation rates were the same regardless of which 
base (A,C,G or T) was mutated and which base it mu- 
tated into, while in reality these rates can be very dif- 
ferent [T^ . I20I l2l| . While this assumption was done to 
simplify the analysis, the mutations rates will depend 
on the specific experimental conditions under which the 
PCRs are performed, and it might thus not have sufficed 
to choose just one set of base dependent mutation rates. 

The simulations, the mean field recursion relations and 
the analysis of the random walk regime can all easily be 
altered to include different mutation rates. Note that the 
exact results for the equilibrium distributions in the ran- 
dom walk regime only hold (in the correct limit) when 
mutation rates are the same for opposite mutations. The 
evolution speed for the mean field regime is also still es- 
sentially correct, although the least likely mutations will 
typically occur last and will take longer, as the seed frac- 
tion is smaller (this is also addressed in the infinite length 
model [Ts|l. 

Non-specific binding. In the main paper we always 
started out with a DNA sequence that binds strongly 
to the protein, such that favorable mutations can be se- 
lected quickly. If we start out with a very poor sequence, 
or, in the simulations, increase the non-specific binding 
strength of the protein, then even highly favorable muta- 
tions will be only weakly selected, as non-specific binding 
will dominate. 

Figure [5] shows how the average number of errors 
changes through the iterations when we start with a se- 
quence with 10 errors (the usual ones and 4 more) and 
various values for the non-specific binding strength. For 
high non-specific binding strengths, none of the muta- 
tions are selected strongly, and "nothing happens" until, 
by chance, several very good mutations combine, yielding 
a sequence that binds significantly more strongly than by 
non-specific binding alone - the DNA pool then abruptly 
improves by all those mutations at once, and the process 
then proceeds as usual. Also, the original sequence now 
contains several (3) very weak errors, and once only these 




Iteration 

FIG. 9: Average distance from WT as a funtion of time for 
various values of non-specific binding, r = 10 -4 , 1 — 1, N = 
2 33 « 8.6 ■ 10 9 



remain, the "one mutation at a time" assumption fails. 

Note that a very large population size is required in or- 
der to be certain that a sufficiently good sequence will be 
produced (exponentially increasing the further into the 
non-specific binding regime we are) , and the random walk 
regime no longer exists - for small N , most experiments 
would never generate the WT sequence. 



B. More on Mean Field 

Mean Field Recursion Relations. Here we derive a 
set of recursion equations in the MF regime. We assume 
that the number of iterations <m and the chemical po- 
tentials /j,(t) are known from a mean field simulation, as 
discussed in the main paper. 

Let nf' 71 be the number of WT sequences at time 
£m originating from a single molecule of sequence S ex- 
isting a time t in the experiment, through a specific 
path 7T (which takes S to WT). Let P+(nf ,7r ) be the 
chance that the molecule produces some nonzero amount 
of WT through n, (n t '™) be the expected amount, and 
Var(nf ,7r ) = ((nf^) 2 ) - (nf'*} 2 be the variance. 

Given these quantities for time t' right after a selection, 
we can compute the values for time t right before the 
selection: 

p+(nfn = Pt, s p + (4n (9) 

(nfn = PtMn (10) 

((nf'T) = PtM'*?) (n) 

Pt > S = l + e S(S)-M(0 ^ 

Similarly, given the quantities for t' right after a cycle 
of PCR, we can compute the values for time t right before 



the PCR: 



(n t ) 



P (4n [l-(l-r) L P + (n*n 
--{l-r) L ^P + {nf,^)_ 



(13) 
(14) 



where S' is the sequence reached by performing the first 
mutation in path tt on S, n' is the remaining path, and 
P (x) = l-P+(x). 

The first term of equation 1131 Po(np v ), is the chance 
that the original, unchanged DNA molecule does not pro- 
duce any WT at time tyi (through path %). The second 
term is the corresponding chance for the possibly mu- 
tated copy: If it was not mutated (chance (1 — r) L ), the 
chance of producing WT through path tt is P + (n^ n ). If 
it was mutated exactly the right way, i.e. according to 
the first step of path tt (chance |(1 — r) i_1 ), the chance 
of producing WT through the remaining part n' of path 

7T IS P+(n^ ), while any other mutation would mean it 
is off the right path. 

Eauation ll4l is straightforward: (n^' 11 ) is the contribu- 
tion from the unchanged molecule, and the other terms 
are for the possibly mutated copy. The equation for 
((nf ,1T ) 2 } follows immediately. These equations cover the 
case where we only allow single mutations; there are addi- 
tional terms when we allow multiple mutations (i.e. mul- 
tiple steps along the path) . 



We trivially know that P+(n^ T ' ) 



- / WT>-\ - 



((rtJ^ T ' - ) 2 ) = 1, and the values are zero for all other 
sequences at that time. By applying the above equations 
for all selections and PCRs in the experiment, in reverse 
order, we find the desired quantities for time t = 0. The 
results of this calculation for r = 10~ 4 and I — 1 are 
shown in Fig. 1771 

Large /. As discussed for 1 = 1, evolution works by 
first producing a seed fraction of improved DNA which 
is then amplified through subsequent iterations. Unlike 
the 1 = 1 case, we can no longer consider all improved 
DNA (with a given number of errors) to be equivalent: 

The maximum amplification (in one iteration) for DNA 
that is better than the majority of the population is 
2 1 . However, for this amplification to actually occur, 
the binding constant of the DNA that is to be amplified 
must be at least 2 1 times that of the typical DNA in the 
population. For 2 1 3> AK, the amplification will simply 
be AK per iteration. 

The highest possible amplification in our model is thus 

TciSsj ~ ^ 15 ' ^ or ver y smau I we expect the number of 
iterations required to improve by one base (equations 7 
and 8), and thus the total number of iterations 4m , to be 
approximately inversely proportional to / (this holds for 
all / for the infinite length model [l^)- As we increase 
/, the various amplifications are gradually saturated (ap- 
proach AK), and for / > 15 they are all saturated. In- 




100 150 200 

# PCRs per iteration (I) 

FIG. 10: Lower bound for mean field regime calculated from 
recursion relations vs. analytical estimate, as function of I. 
r = 10~ 4 . 



creasing / beyond this only serves to increase the effective 
rate of mutation. Table ITTTI shows how £m depends on /, 
and this matches our expectations well. 



I 


1 


2 


4 


6 


10 


15 


20 


30 


50 


100 


tm 


151 


80 


46 


36 


28 


25 


22 


22 


21 


20 



TABLE III: Number of iterations for mean field simulations. 
r = 10~ 7 for all values. 

For large /, the amplification of a given DNA sequence 
in an iteration is approximately the ratio of its binding 
constant to that of the typical DNA in the population, 
thus DNA that is much better than the average will be 
amplified far more strongly. In particular, DNA that 
mutates to WT during the very first iteration will receive 
the maximum possible amplification, and this process is 
the one that contributes the most WT in a simulation. 

We can estimate this contribution and the corrections 
due to DNA that does not quite reach WT the first iter- 
ation, as well as the variation from this process. These 
calculations are carried out below and give us an estimate 
for the lower bound of the mean field regime for large /: 



N > max 



Var (n^y T (f M )) 
(*m)) 2 



x 



(15) 



Figure ITU1 compares this estimate (equation 12 1|) to the 
(highly accurate) values found from the recursion rela- 
tions (eqs.EEJ. 

Mean field estimates: Large /. When / is suffi- 
ciently large, the chemical potential for the selections is 
much lower than even the binding energy of WT, thus 
the chance that a DNA molecule will survive a selection 
is proportional to the binding constant of the sequence. 
The chance that each initial error will be corrected during 
any given iteration (/ cycles of PCR) is -y , but for each 
iteration that mutation i has not happened, the DNA 
will be amplified by a factor AKi less than those that 



9 



have reached WT. The number of WT molecules cre- 
ated (through all paths), as a function of time, is ap- 
proximately 



(nwT(*)) « fi(t)2 



it— i 



f)"n( T 



1 



(16) 

where i runs over the mutated positions, and fi(t) is a 
function describing the ratio of WT molecules that would 
have survived all the selections so far. The powers of 2 
is the total number of PCRs minus the number of cycles 
in which the DNA mutated (ignoring simultaneous mu- 
tations), and the last factor corrects for mutations that 
didn't happen the first iteration. 

The main source of variance is whether or not a start- 
ing molecule produces a WT molecule that survives the 
first selection: 



Var(r^ T (t M )) « (Kvt(*m)) 2 ) 



1) fl(tu) \ 2 2- r_m fir- 



2™ P V"tT C 1- ^^) -1 ) 

ml I Tr I 11 



AX 2 



(18) 



where we have used //(l) = 2 J ^is assumes that, 

before the selection for the first iteration, most of the 
DNA sequences have not been mutated, which fixes the 
chemical potential The first factor of eq. El is the 
average number of WT molecules that a WT molecule 
that survives the first selection would produce by time 
tyi , squared, and the remaining factors are the chance 
that such a molecule is produced; the expected number 
of WT molecules produced through a single path during 
the I cycles of PCR in the first iteration times the chance 
//(l) that each WT molecule survives the first selection. 

We eliminated fi{tyi) from eauationll7lbv using equa- 
tion [23 setting (uwt^m)) ~ 1- From this we can get 
a rough estimate for the lower bound of the mean field 
regime by using (^ t (<m)) rs ^-j, as for I = 1. However, 
for large / we can do better by calculating the relative 
contributions of individual paths directly: 



Ir 



(«wt(*)) * //W2"- m ( y ) 9m{-K)AK{-K) (19) 



9A) ~ It! + ^ J AK(n)-V 



(20) 



where TTj means the last j steps of path ir, and AK(n) 
is the total change in binding constant, i.e. the prod- 
uct of the AK for the individual mutations of path 7r. 
Knowledge of initial sequence is implicit. 

Using eauationll9lto eliminate // (t) in eauation ll7l we 
find the bound for the MF regime directly: 



2 m / 3 
TV > max — - — 
Tr m\ \ Ir 



AK{tt) 



(21) 



Mean field estimates: / = 1. A DNA molecule that 
has higher binding energy than the average will typically 
survive a selection, and vice versa. The simplest estimate 
we can use for the behavior of the average binding energy 
is a linear change from the binding energy of SS to that 
of WT). 

Given that a DNA molecule mutates from SS to WT 
during the tyi iterations of the simulation, we estimate 
that it will survive all the selections iff the number of 

The number of 



(t M )' 



-: There 



good mutations is always at least ^ 

ways this can happen is approximately to 
must be a mutation at the first iteration, giving the factor 
to and leaving approximately (iivi)'™ -1 combinations for 
the remaining mutations. Given to points randomly dis- 
tributed on a circle of length m, there is with probability 
1 exactly one point for which, when traveling clockwise 
from that point, you will always have visited more points 
than the length you have traveled (assuming 4m S> m, 
this continuum case should be good enough) — thus, the 

chance that we started from the right point/mutation is 
j_ 

m ' 

From this, the number of WT molecules produced from 
a single SS molecule in f m iterations is about 



(^wt(^m)) 



-itM-m 



/r\ m 

(3) W 



(22) 



where the powers of two are the number of iterations 
where the DNA didn't change (and thus was duplicated). 

The simulation ends when there are about as many 
WT molecules as there were SS molecules at the begin- 
ning: 



(«wt(*m)) 



(23) 



m log ; 



/ 3 



\MR 



(m- l)log 2 (t M ) 



This is very similar to T + EZ=i T ( m ')> 

using equa- 
tions 7 and 8 and log 2 (tM) ~ wlog 2 (T) — most of the 
discrepancy corresponds to the higher order corrections 
i ~(2) found for the infinite length model p^ |. 

Regarding the variance Var(n^ T (tM)) of the number 
of WT molecules produced from a single SS molecule 
through a single path tt, any duplication that occurs be- 
fore the WT is reached will only yield a factor of 2 — the 
two molecules must independently develop to the WT — 
while duplication once the WT is reached gives a factor 
of 4. Because of this, the variance is dominated by the 
extremely rare events where all the good mutations hap- 
pen very early (the first m + A iterations), and the WT 
molecules are then simply duplicated until the simulation 
ends: 



Var (tt-wt^m)) ~ ("wtI'm)) 2 

(to - 2 



-2m-A 



A>0 

2 2t M -m-l 



A)! 



A!(to-2)! 



(24) 
(25) 
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If the DNA is mutated to WT in the first m + A iter- 
ations (A is the number of "idle" iterations), it will then 
be duplicated £m — fn — A times, giving a contribution 
2 2t M -2m-2A Alg0j the DNA will be duplicated during 

each idle iteration, but this merely increases the chance 
that a molecule will reach WT, and thus gives a factor 
2 A . 

The number of ways the mutations can happen is 
, assuming that there must be a mutation the 
first iteration — we have ignored the possibility of multi- 
ple mutations in one cycle of PCR, which is a significant 
correction. Equation 1251 follows by using 

A>0 \«>0 / 

We can then use the average (n^ T (t M )) w = ^ to esti- 
mate (very roughly) the lower bound for the mean field 



regime. 

For r = 10~ 4 we estimate £m ~ 65 and thus 
Var (?Ivvt(*m)) « 5.8-10 10 . The actual values are £m = 83 
and Var (n^ T (in)) ~ 1.5 • 10 12 . The somewhat odd fact 
that <m is pretty far off, while the variance, which de- 
pends very strongly on tyi, is reasonably close, is not 
entirely unexpected: Towards the end of the simulation, 
many WT molecules will be selected away (as the chemi- 
cal potential for the selection approaches the binding en- 
ergy of WT), which will increase £m- However, this will 
also affect the variance correspondingly, such that using 
the solution from equation 1231 in equation 1251 will still 
give the right value. Considering that we have used an 
extremely simple approximation for selection (and com- 
pletely ignored the AK for the various mutations), the 
estimate seems very reasonable. As a check, when using 
very large energies for the mutations (while keeping the 
ratios fixed), the simulations yield £m = 65 for the mean 
field regime. 



[1] Bittker, J.A., Phillips, K.J. & Liu, D.R. (2002) Curr. 33. 

O-pin. Chem. Biol. 6, 367-374. [21] Lin-Goerke, J., Robbins, D. & Burczak, J. (1997) 

[2] Landweber, L.F. (1999) Trends Ecol. Evol. 14, 353-358. Biote.chniqu.es 23, 409-412. 

[3] Arnold, F.H., ed. (2001) Evolutionary Protein Design 

(Academic Press, San Diego). 
[4] Novella, I.S., Duarte, E.A., Elena, S.F., Moya, A., 

Domingo, E. & Holland, J.J. (1995) Proc. Natl. Acad. 

Sci. USA 92, 5841-5844. 
[5] Tsimring, L., Levine, H. & Kessler, D.A. (1996) Phys. 

Rev. Lett. 76, 4440-4443. 
[6] Kessler, D.A., Levine, H., Ridgway, D. & Tsimring, L. 

(1997) J. Stats. Phys. 87, 519-544. 
[7] Ridgway, D., Levine, H. & Kessler, D.A. (1998) J. Stats. 

Phys. 90, 191-210. 
[8] Peng, W., Gerland , U., Hwa, T. & Levine, H. (2002) 

cond-mat/0204117 
[9] Dubertret, B., Liu, S., Quyang, Q. & Libchaber, A. 

(2001) Phys. Rev. Lett. 86, 6022-6025. 
[10] Fields, D.S., He, Y., Al-Uzri, A. & Stormo, G.D. (1997) 

J. Mol. Biol. 271, 178-194. 
[11] Stormo, G.D. & Fields, D.S. (1998) Trends in Biochem- 
ical Sciences 23, 109-113. 
[12] Man, T.-K. & Stormo, G.D. (2001) Nucl. Acid. Res. 29, 

2471-2478. 

[13] Record, M.T. Jr., deHaseth, P.L. & Lohman, T.M. (1977) 

Biochemistry 16, 4791-4796. 
[14] Winter, R.B. & von Hippel, PH. (1981) Biochemistry 20, 

6948-6960. 

[15] Frank, D.E., Saecker, R.M., Bond, J.P, Capp, M.W., 

Tsodikov, O.V., Melcher, S.E., Levandoski, M.M. & 

Record, M.T. Jr. (1997) J. Mol. Biol. 267, 1186-1206. 
[16] Peliti, L. (1997) cond-mat/9712027 

[17] Woodcock, G. & Higgs, P.G. (1996) J. Theor. Biol. 179, 
61-73. 

[18] Kloster, M. unpublished. 

[19] Shafikhani, S., Siegel, R., Ferrari, E. & Schellenberger, 

V. (1997) Biotechniques 23, 304-310. 
[20] Cadwell, R. & Joyce, G. (1992) PCR Meth. Appl. 2, 28- 




Probability color scale (leaves): 



-11 



-10 -9 -! 

Mutation energy scale (edges): 



log 10 (P + (ng s '*)) 



AK 



1 2 3 4 5 6 7 

Average #WT scale: 



9 10 11 12 



<n^ s ' It > 



id 



£ a 

> 



•r n 

O O 

CD TJ ^ 

s I II 

2 ■§ «- 
o 



o 
> 
'5b 



3 

s 

O S S-l 

£ « * 

<P cp 

o_, i—H CC 

CP 

CP += 0) 



t3 cd 
.3 

o o 

CD 



2e-4 5e-4 le-3 2e-3 5e-3 0.01 0.02 0.05 0.1 0.2 0.5 1 



*-< .a - 5 



CD 

s 'S 

o 

CD 
!>> 



